[WIP] full finetune / qlora + ac/offload/optm in bwd #21

weifengpy · 2024-02-26T20:40:51Z

Why composing FSDP with NF4Tensor

QLoRA : number of trainable parameters are reduced from xxx to xxx. parameter size are reduced by xx
Full finetuning original Llama with 4bit quantized params:

7B + QLora on FFNs: summarizing memory usage below with bf16, adamW, AC, cpu offloading

sharding NF4Tensor in FSDP: NF4Tensor are 4 bit quant weight from QLora
cpu offloading NF4Tensor in FSDP: most profitabble
gradient in the backward, 8bit optimizer: does not matter in QLora because of tiny gradable parameters. should priooritize in full training

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

awgu · 2024-02-26T22:03:03Z

Should we call out that this table assumes that we are only applying QLoRA to the FFNs?

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

drisspg · 2024-03-07T21:01:37Z

transformer_nuggets/quant/nf4_tensor.py

+def nf4_detach(aten_op, args, kwargs=None):
+    # nn.Parameter need detach
+    quantized_data = aten_op(args[0].quantized_data, *args[1:], **kwargs)
+    tensor_meta = SubclassTensorArgs(


https://github.com/drisspg/transformer_nuggets/pull/24/files#diff-6954806b06a395823616c5a0fefc77b5cf274b3f88a5381c38856cbf74d6f2bfR37-R45

I hijacked the tensor unflatten/flatten to construct the new tensors,

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-07-15T22:34:47Z

close is since QLoRA + FSDP2 and cpu offloading has been landed into torchtune: pytorch/torchtune#909

full finetune / qlora + ac/offload/optm in bwd

a60f6e8

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy changed the title ~~full finetune / qlora + ac/offload/optm in bwd~~ [WIP] full finetune / qlora + ac/offload/optm in bwd Feb 26, 2024

weifengpy marked this pull request as draft February 27, 2024 20:34

sharding and cpu offloadin nf4tensor

0576139

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

drisspg reviewed Mar 7, 2024

View reviewed changes

avoid hardcoding shapes and gpus in __torch_dispatch__

8862789

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy closed this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] full finetune / qlora + ac/offload/optm in bwd #21

[WIP] full finetune / qlora + ac/offload/optm in bwd #21

weifengpy commented Feb 26, 2024 •

edited

Loading

awgu commented Feb 26, 2024

drisspg Mar 7, 2024

weifengpy commented Jul 15, 2024 •

edited

Loading

[WIP] full finetune / qlora + ac/offload/optm in bwd #21

[WIP] full finetune / qlora + ac/offload/optm in bwd #21

Conversation

weifengpy commented Feb 26, 2024 • edited Loading

Why composing FSDP with NF4Tensor

awgu commented Feb 26, 2024

drisspg Mar 7, 2024

Choose a reason for hiding this comment

weifengpy commented Jul 15, 2024 • edited Loading

weifengpy commented Feb 26, 2024 •

edited

Loading

weifengpy commented Jul 15, 2024 •

edited

Loading